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Abstract 

We describe a planning algorithm that integrates two ap- 
proaches to solving Markov decision processes with large 
state spaces. State abstraction is used to avoid evaluating 
states individually. Forward search from a start state, guided 
by an admissible heuristic, is used to avoid evaluating all 
states. We combine these two approaches in a novel way that 
exploits symbolic model -checking techniques and demon- 
strates their usefulness for decision- theoretic planning. 

Introduction 

Markov decision processes (MDPs) have been adopted as a 
framework for research in decision-theoretic planning. Clas- 
sic dynamic programming algorithms solve MDPs in time 
polynomial in the size of the state space. However, the 
size of the state space grows exponentially with the number 
of features describing the problem. This “state explosion” 
problem limits use of the MDP framework, and overcoming 
it has become an important topic of research. 

Over the past several years, approaches to solving MDPs 
that do not rely on complete state enumeration have been 
developed. One approach exploits a feature-based (or fac- 
tored) representation of an MDP to create state abstractions 
that allow the problem to be represented and solved more 
efficiently (Dearden & Boutilier 1997; Hoey et al. 1999; 
and many others). Another approach limits computation 
to states that are reachable from the starting state(s) of the 
MDP (Barto, Bradtke, & Singh 1995; Dean et al. 1995; 
Hansen & Zilberstein 2001). In this paper, we show how 
to combine these two approaches. Moreover, we do so in 
a way that demonstrates the usefulness of symbolic model- 
checking techniques for decision-theoretic planning. 

There is currently great interest in using symbolic model- 
checking to solve artificial intelligence planning problems 
with large state spaces. This interest is based on recognition 
that the problem of finding a plan (i.e., a sequence of ac- 
tions and states leading from an initial state to a goal state) 
can be treated as a reachability problem in model checking. 
The planning problem is solved symbolically in the sense 
that reachability analysis is performed by manipulating sets 
of states, rather than individual states, alleviating the state 
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explosion problem. Recent work has shown that symbolic 
model checking provides a very effective approach to non- 
deterministic planning (Cimatti, Roveri, & Traverso 1998). 
Nondeterministic planning is similar to decision-theoretic 
planning in that it considers actions with multiple outcomes, 
allows plan execution to include conditional and iterative be- 
havior, and represents a plan as a mapping from states to 
actions. However, decision-theoretic planning is more com- 
plex than nondeterminsitic planning because it associates 
probabilities and rewards with state transitions. The prob- 
lem is not simply to construct a plan that reaches the goal, 
but to find a plan that maximizes expected value. 

The first use of symbolic model checking for decision- 
theoretic planning is the SPUDD planner (Hoey et al. 1 999), 
which solves factored MDPs using dynamic programming. 
Although it does not use reachability analysis, it uses sym- 
bolic model-checking techniques to perform dynamic pro- 
gramming efficiently on sets of states. Our paper builds 
heavily on this work, but extends it in an important way. 
Whereas dynamic programming solves an MDP for all pos- 
sible starting states, we use knowledge of the starting state(s) 
to limit planning to reachable states. In essence, we show 
how to improve the efficiency of the SPUDD framework by 
using symbolic reachability analysis together with dynamic 
programming. Our algorithm can also be viewed as a gen- 
eralization of heuristic-search planners for MDPs, in which 
our contribution is to show how to perform heuristic search 
symbolically for factored MDPs. The advantage of heuristic 
search over dynamic programming is that it focuses planning 
resources on the relevant parts of the state space. 

Factored MDPs and decision diagrams 

A Markov decision process (MDP) is defined as a tuple 
(5, A, P, R ) where: S is a set of states; A is a set of ac- 
tions; P is a set of transition models P a : 5 x S — * [0. 1], 
one for each action, specifying the transition probabilities of 
the process; and R is a set of reward models R a : S — * -ft, 
one for each action, specifying the expected reward for tak- 
ing action a in each state. We consider MDPs for which the 
objective is to find a policy 1 1 : S —* A that maximizes to- 
tal discounted reward over an infinite (or indefinite) horizon, 
where 7 € [0, 1] is the discount factor. (We allow a discount 
factor of 1 for indefinite-horizon problems only, that is, for 
MDPs that terminate after a goal state is reached.) 
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In a factored MDP, the set of states is described by a set of 
random variables X = {Xi , . . . , X n }. Without loss of gen- 
erality, we assume these are Boolean variables. A particular 
instantiation of the variables corresponds to a unique state. 
Because the set of states S = 2 X grows exponentially with 
the number of variables, it is impractical to represent the 
transition and reward models explicitly as matrices when the 
number of states variables is large. Instead we follow Hoey 
et al (1999) in using algebraic decision diagrams to achieve 
a more compact representation. 

Algebraic decision diagrams (ADDs) are a generalization 
of binary decision diagrams (BDDs), a compact data struc- 
ture for Boolean functions used in symbolic model checking. 
A decision diagram is a data structure (corresponding to an 
acyclic directed graph) that compactly represents a mapping 
from a set of Boolean state variables to a set of values. A 
BDD represents a mapping to the values 0 or 1. An ADD 
represents a mapping to any finite set of values. To repre- 
sent these mappings compactly, decision diagrams exploit 
the fact that many instantiations of the state variables map 
to the same value. In other words, decision diagrams ex- 
ploit state abstraction. BDDs are typically used to represent 
the characteristic functions of sets of states and the tran- 
sition functions of finite-state automata. ADDs can repre- 
sent weighted finite-state automata, where the weights cor- 
respond to transition probabilities or rewards, and thus are 
an ideal representation for MDPs. 

Hoey et al (1999) describe how to represent the transi- 
tion and reward models of a factored MDP compactly using 
ADDs. We adopt their notation and refer to their paper for 
details of this representation. Let X = {Xu . . . , X n } rep- 
resent the state variables at the current time and let X 7 = 
{X{, , . . , X^} represent the state variables at the next step. 
For each action, an ADD P a (X,X') represents the transi- 
tion probabilities for the action. Similarly, the reward model 
R a (X) for each action a is represented by an ADD. The ad- 
vantage of using ADDs to represent mappings from states 
(and state transitions) to values is that the complexity of op- 
erators on ADDs depends on the number of nodes in the 
diagrams, not the size of the state space. If there is sufficient 
regularity in the model, ADDs can be very compact, allow- 
ing problems with large state spaces to be represented and 
solved efficiently. 

Symbolic LAO* algorithm 

To solve factored MDPs, we describe a symbolic generaliza- 
tion of the LAO* algorithm (Hansen & Zilberstein 2001). 
LAO* is an extension of the classic search algorithm AO* 
that can find solutions with loops. This makes it possible 
for LAO* to solve MDPs, since a policy for an infinite- 
horizon MDP allows both conditional and cyclic behavior. 
Like AO*, LAO* has two alternating phases. First, it ex- 
pands the best partial solution (or policy) and evaluates the 
states on its fringe using an admissible heuristic function. 
Then it performs dynamic programming on the states vis- 
ited by the best partial solution, to update their values and 
possibly revise the currently best partial solution. The two 
phases alternate until a complete solution is found, which is 
guaranteed to be optimal. 


AO* and LAO* differ in the algorithms they use in the dy- 
namic programming step. Because AO* assumes an acyclic 
solution, it can perform dynamic programming in a single 
backward pass from the states on the fringe of the solution 
to the start state. Because LAO* allows solutions with cy- 
cles, it relies on an iterative dynamic programming algo- 
rithm (such as value iteration or policy iteration). In orga- 
nization, the LAO* algorithm is similar to the “envelope” 
dynamic programming approach to solving MDPs (Dean et 
a! . 1995). It is also closely related to RTDP (Barto, Bradtke, 
& Singh 1995), which is an on-line (or “real time”) search 
algorithm for MDPs, in contrast to LAO*, which is an off- 
line search algorithm. 

We call our generalization of LAO* a symbolic search al- 
gorithm because it manipulates sets of states, instead of indi- 
vidual states. In keeping with the symbolic model-checking 
approach, we represent a set of states S by its characteristic 
function xs> so that s e S <=> Xsi s ) — 1- We repre- 
sent the characteristic function of a set of states by an ADD. 
(Because its values are 0 or 1, we can also represent a char- 
acteristic function by a BDD.) From now on, whenever we 
refer to a set of states, 5 , we implicitly refer to its character- 
istic function, as represented by a decision diagram. 

In addition to representing sets of states as ADDs, we rep- 
resent every element manipulated by the LAO* algorithm as 
an ADD, including: the transition and reward models; the 
policy 7 r : S — * A\ the state evaluation function V : S — * 9? 
that is computed in the course of finding a policy; and an ad- 
missible heuristic evaluation function h : S 9? that guides 
the search for the best policy. Even the discount factor 7 is 
represented by a simple ADD that maps every input to a 
constant value. This allows us to perform all computations 
of the LAO* algorithm using ADDs. 

Besides exploiting state abstraction, we want to limit 
computation to the set of states that are reachable from the 
start state by following the best policy. Although an ADD 
effectively assigns a value to every state, these values are 
only relevant for the set of reachable states. To focus com- 
putation on the relevant states, we introduce the notion of 
masking an ADD. Given an ADD D and a set of relevant 
states (/, masking is performed by multiplying D by \u- 
This has the effect of mapping all irrelevant states to the 
value zero. We let Du denote the resulting masked ADD. 
(Note that we need to have U in order to correctly interpret 
Du)- Mapping all irrelevant states to zero can simplify the 
ADD considerably. If the set of reachable states is small, the 
masked ADD often has dramatically fewer nodes. This in 
turn can dramatically improve the efficiency of computation 
using ADDs . 1 

Our symbolic implementation of LAO* does not maintain 
an explicit search graph. It is sufficient to keep track of the 
set of states that have been “expanded” so far, denoted G, 
the partial value Junction , denoted Vc, and a partial policy , 


‘Although we map the values of irrelevant states to zero, it does 
not matter what value they have. This suggests a way to simplify 
a masked ADD further. After mapping irrelevant states to zero, 
we can change the value of a irrelevant state to any other non-zero 
value whenever doing so further simplifies the ADD. 


policy Expansion^, S°, G) 

1. F = F = 0 

2 . from = S° 

3. REPEAT 

4. to — |J a Image(from D 5“, P a ) 

5. F — F U (to - G) 

6 . E = E U from 

7. from = toCiG - E 

8 . UNTIL ( from = 0) 

9. F = FuF 

10. G = GuF 

11. RETURN (F, F, G) 

valueIteration(F, V) 

12. saueV = V 

13. E f = lj a Image(E , P a ) 

14. REPEAT 

15. K' = V 

16. FOR each action a 

17. U a = FJ, + 7 ^2 e , PeuE'^E' 

18. M — max a V a 

19. V = Me + saveVg 

20 . residual = \\Ve - V^|| 

21 . UNTIL stopping criterion met 

22 . 7 r = extract Policy (M , {U a }) 

23. RETURN (V, tt, residual ) 

LAO*({P a }, {F 0 }, 7 , 5°, /i, threshold ) 

24. U = /i 

25. G = 0 

26. 7r — 0 

27. REPEAT 

28. (F, F, G) = policy Expansion^ , 5°, G) 

29. (V, 7 Tj residual) = valueIteration(E , V) 

30. UNTIL (F = 0) AND (residual < threshold) 

31. RETURN (tt,U,F,G). 

Table 1 : Symbolic LAO* algorithm. 


denoted 7 Tg. For any state in G, we can “query*’ the policy 
to determine its associated action, and compute its successor 
states. Thus, the graph structure is implicit in this represen- 
tation. Note that throughout the whole LAO* algorithm, we 
only maintain one value function V and one policy n. Vg 
and 7i G are implicitly defined by G and the masking opera- 
tion. 

Symbolic LAO* is summarized in Table 1 . In the follow- 
ing, we give a more detailed explanation. 

Policy expansion In the policy expansion step of the al- 
gorithm, we perform reachability analysis to find the set of 
states F that are not in G (i.e., have not been “expanded” 
yet), but are reachable from the set of start states, 5°, by fol- 
lowing the partial policy 7 r <?. These states are on the “fringe” 
of the states visited by the best policy. We add them to G and 
to the set of states E C G that are visited by the current par- 
tial policy. This is analogous to “expanding” states on the 
frontier of a search graph in heuristic search. Expanding a 


partial policy means that it will be defined for a larger set of 
states in the dynamic-programming step. 

Symbolic reachability analysis using decision diagrams is 
widely used in VLSI design and verification. Our policy- 
expansion algorithm is similar to the traversal algorithms 
used for sequential verification, but is adapted to handle the 
more complex system dynamics of an MDP. The key op- 
eration in reachability analysis is computation of the image 
of a set of states, given a transition function. The image 
is the set of all possible successor states. To perform this 
operation, it is convenient to convert the ADD P a (X,X') 
to a BDD T°(X,X') that maps state transitions to a value 
of one if the transition has a non-zero probability, and oth- 
erwise zero. The image computation is faster using BDDs 
than ADDs. Mathematically, the image is computed using 
the relational-product operator, defined as follows: 

Images (S,T a ) = 3x [T a (X, X') A xs(X)] . 

The conjunction T a (X, X') A xsPQ selects the set of valid 
transitions and the existential quantification extracts and 
unions the successor states together. Both the relational- 
product operator and symbolic traversal algorithms are well 
studied in the symbolic model checking literature, and we 
refer to that literature for details about how this is computed, 
for example, (Somenzi 1999). 

The image operator returns a characteristic function over 
X' that represents the set of reachable states after an action 
is taken. The assignment in line 4 implicitly converts this 
characteristic function so that it is defined over X, and rep- 
resents the current set of states ready for the next expansion. 

Because a policy is associated with a set of transition 
functions, one for each action, we need to invoke the appro- 
priate transition function for each action when computing 
successor states under a policy. For this, it is useful to repre- 
sent the partial policy 7 tq in another way. We associate with 
each action a the set of states for which the action to take 
is a under the current policy, and call this set of states S£. 
Note that S£ n — 0 for a ^ a', and U a F£ = G. Given 
this alternative representation of the policy, line 4 computes 
the set of successor states following the current policy using 
the image operator. 

Dynamic programming The dynamic- programming step 
of LAO* is performed using a modified version of the 
SPUDD algorithm. The original SPUDD algorithm per- 
forms dynamic programming over the entire state space. 
We modify it to focus computation on reachable states, 
using the idea of masking. Masking lets us perform dy- 
namic programming on a subset of the state space instead 
of the entire state space. The pseudocode in Table 1 as- 
sumes that dynamic programming is performed on F, the 
states visited by the currently best (partial) policy. This 
has been shown to lead to the best performance of LAO*, 
although a larger or smaller set of states can also be up- 
dated (Hansen & Zilberstein 2001). Note that all ADDs 
used in the dynamic-programming computation are masked 
to improve efficiency. 

Because 7 r<- is a partial policy, there can be states in E 
with successor states that are not in G, denoted E' . This 



is true until LAO* converges. In line 13, we identify these 
states so that we can do appropriate masking. To perform 
dynamic programming on the states in E , we assign admis- 
sible values to the “fringe” states in E ' , where these values 
come from the current value function. Note that the value 
function is initialized to an admissible heuristic evaluation 
function at the beginning of the algorithm. 

With all components properly masked, we can perform 
dynamic programming using the SPUDD algorithm. This is 
summarized in line 17. The full equation is 

V a (X) = R a E (X) + 7 £ PbuB'&>X') • Vfi.(X'). 

E' 

The masked ADDs R a E and P£ uE , need to be computed 
only once for each call to valueIteration() since they don’t 
change between iterations. Note that the product ?£ U£ , • V E , 
is effectively defined over E U E f . After the summation 
over which is accomplished by existentially abstracting 
away all post-action variables, the resulting ADD is effec- 
tively defined over E only. As a result, V a is effectively a 
masked ADD over E , and the maximum M at line 18 is also 
a masked ADD over E. In line 19, we use the notation Me 
to emphasize that V is set equal to the newly computed val- 
ues for E and the saved values for the rest of the state space. 
There is no masking computation performed. 

The residual in line 20 can be computed by finding the 
largest absolute value of the ADD (Ve - V E ). We use 
the masking subscript here to emphasize that the residual 
is computed only for states in the set E. The masking opera- 
tion can actually be avoided here since at this step, Ve — M, 
which is computed in line 18, and V E is the M from the pre- 
vious iteration. 

Dynamic programming is the most expensive step of 
LAO*, and it is usually not efficient to run it until conver- 
gence each time this step is performed. Often a single itera- 
tion gives the best performance. After performing value iter- 
ation, we extract a policy in line 22 by comparing M against 
the action value function V a (breaking ties arbitrarily): 

Vs e E ?r(s) = a if M(s) = V a {s). 

The symbolic LAO* algorithm returns a value function 
V and a policy 7r, together with the set of states E that are 
visited by the policy, and the set of states G that have been 
“expanded” by LAO*. 

Convergence test At the beginning of LAO*, the value 
function V is initialized to the admissible heuristic h that 
overestimates the optimal value function. Each time value 
iteration is performed, it starts with the current values of V. 
Hansen and Zilberstein (2001) show that these values de- 
crease monotonically in the course of the algorithm; are al- 
ways admissible; and converge arbitrarily close to optimal. 
LAO* converges to an optimal or c-optimal policy when two 
conditions are met: (1) its current policy does not have any 
unexpanded states, and (2) the error bound of the policy is 
less than some predetermined threshold. Like other heuristic 
search algorithms, LAO* can find an optimal solution with- 
out visiting the entire state space. The convergence proofs 
for the original LAO* algorithm carry over in a straightfor- 
ward way to symbolic LAO*. 


Admissible heuristics LAO* uses an admissible heuristic 
to guide the search. Because a heuristic is typically defined 
for all states, a simple way to create an admissible heuris- 
tic is to use dynamic programming to create an approximate 
value function. Given an error bound on the approximation, 
the value function can be converted to an admissible heuris- 
tic. (Another way to ensure admissibility is to perform value 
iteration on an initial value function that is admissible, since 
each step of value iteration preserves admissibility.) Sym- 
bolic dynamic programming can be used to compute an ap- 
proximate value function efficiently. St. Aubin et al. (2000) 
describe an approximate dynamic programming algorithm 
for factored MDPs, called APRICODD, that is based on 
SPUDD. It simplifies the value function ADD by aggregat- 
ing states with similar values. Another approach to approx- 
imate dynamic programming for factored MDPs described 
by Dearden and Boutilier (1997) can also be used to com- 
pute admissible heuristics. 

Use of dynamic programming to compute an admissible 
heuristic points to a two-fold approach to solving factored 
MDPs. First, dynamic programming is used to compute an 
approximate solution for all states that serves as a heuristic. 
Then heuristic search is used to find an exact solution for a 
subset of reachable states. 

Experimental results 

Table 2 compares the performance of LAO* and SPUDD 
on the factory examples (f to f6) used by Hoey et al. (1999) 
to test the performance of SPUDD, as well as some addi- 
tional examples (al to a4). We use additional test examples 
because many of the state variables in the factory examples 
represent resources that cannot be affected by any action. 
As a result, we found that only a small number of states are 
reachable from a given start state in these examples. Exam- 
ples al to a4 (which are artificial examples) are structured 
so that every state variable can be changed by some action, 
and thus, most or all of the state space can be reached from 
any start state. Such examples present a greater challenge to 
a heuristic-search approach. 

Because the performance of LAO* depends on the start 
state, the experimental results reported for LAO* in Table 2 
are averages for 50 random starting states. To create an ad- 
missible heuristic, we performed several iterations (ten for 
the factory examples and twenty for the others) of an approx- 
imate value iteration algorithm similar to APRICODD (St- 
Aubin, Hoey, & Boutilier 2000). The algorithm was started 
with an admissible value function created by assuming the 
maximum reward is received each step. The time used to 
compute the heuristic for these examples is between 2% and 
8% of the running time of SPUDD on the same examples. 
Experiments were performed on a Sun UltraSPARC II with 
a 300MHz processor and 2 gigabytes of memory. 

LAO* achieves its efficiency by focusing computation on 
a subset of reachable states. The column labelled \E\ shows 
the average number of states visited by an optimal policy, 
beginning from a random start state. Clearly, the factory ex- 
amples have an unusual structure, since an optimal policy 
for these examples visits very few states. The numbers are 
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Table 2: Performance comparison of LAO* (both symbolic and non-symbolic) and SPUDD. 


much larger for examples al through a4. The column la- 
beled reach shows the average number of states that can be 
reached from the start state, by following any policy. The 
column labelled |G| is important because it shows the num- 
ber of states “expanded” by LAO*. These are states for 
which a backup is performed at some point in the algorithm, 
and this number depends on the quality of the heuristic. The 
better the heuristic, the fewer states need to be expanded to 
find an optimal policy. The gap between | JE7| and reach re- 
flects the potential for increased efficiency using heuristic 
search, instead of simple reachability analysis. The gap be- 
tween |G| and reach reflects the actual increased efficiency. 

The columns labeled N and L, under LAO* and SPUDD 
respectively, compare the size of the final value function re- 
turned by symbolic LAO* and SPUDD. The columns under 
N give the number of nodes in the respective value function 
ADDs, and the columns under L give the number of leaves. 
Because LAO* focuses computation on a subset of the state 
space, it finds a much more compact solution (which trans- 
lates into increased efficiency). 

The last five columns compare the average running times 
of symbolic LAO* to the running times of non-symbolic 
LAO* and SPUDD. Times are given in CPU seconds. For 
many of these examples, the MDP model is too large to 
represent explicitly. Therefore, our implementation of non- 
symbolic LAO* uses the same ADD representation of the 
MDP model as symbolic LAO* and SPUDD. However, non- 
symbolic LAO* performs heuristic search in the conven- 
tional way by creating a search graph in which the nodes 
correspond to “flat” states that are enumerated individually. 

The total running time of symbolic LAO* is broken down 
into two parts; the column “exp.” shows the average time 
for policy expansion and the column “DP” shows the av- 
erage time for dynamic programming. These results show 
that dynamic programming consumes most of the running 
time. This is in keeping with a similar observation about the 
original (non-symbolic) LAO* algorithm. The time reported 
for dynamic programming includes the time for masking. 
For this set of examples, masking takes between 0.5% and 
2.1% of the running time of the dynamic programming step. 


The last three columns compare the average time it takes 
symbolic LAO* to solve each problem, for 50 random start- 
ing states, to the running times of non-symbolic LAO* and 
SPUDD. This comparison leads to several observations. 

First, we note that the running time of non-symbolic 
LAO* is correlated with |G|, the number of states evaluated 
(i.e., expanded) during the search, which in turn is affected 
by the starting state, the reachability structure of the prob- 
lem, and the accuracy of the heuristic function. As |G| in- 
creases, the running time of non-symbolic LAO* increases. 
The search graphs for examples a3 and a4 are so large that 
these problems cannot be solved using non-symbolic LAO*. 
(NA indicates that the problem could not be solved.) 

The running time of symbolic LAO* depends not only 
on |G|, but on the degree of state abstraction the symbolic 
approach achieves in representing the states in G. For the 
factory examples and example al, the number of states eval- 
uated by LAO* is small enough that the overhead of sym- 
bolic search outweighs the improved efficiency from state 
abstraction. For these examples, symbolic LAO* is some- 
what slower than non-symbolic LAO*. But for examples 
a2 to a4, the symbolic approach significantly - and some- 
times dramatically - improves the performance of LAO*. 
Symbolic LAO* also outperforms SPUDD for all examples. 
This is to be expected since LAO* solves the problem for 
only part of the state space. Nevertheless, it demonstrates 
the power of using heuristic search to focus computation on 
relevant states. 

We conclude by noting that examples a3 and a4 are be- 
yond the range of both SPUDD and non-symbolic LAO*, 
or can only be solved with great difficulty. Yet symbolic 
LAO* solves both examples efficiently. This illustrates the 
advantage of combining heuristic search and state abstrac- 
tion, rather than relying on either approach alone. 

Related work 

As noted in the introduction, symbolic model checking 
techniques have been used previously for nondeterministic 
planning. In both nondeterministic and decision-theoretic 
planning, plans may contain cycles that represent iterative, 








































































































or ”trial-and-error,” strategies. In nondeterministic plan- 
ning, the concept of a strong cyclic plan plays a central 
role (Cimatti, Roved, & Traverso 1998; Daniele, Traverso, 
& Vardi 1999). It refers to a plan that contains an iterative 
strategy and is guaranteed to eventually achieve the goal. 
The concept of a strong cyclic plan has an interesting anal- 
ogy in decision-theoretic planning. LAO* was originally de- 
veloped for the framework of stochastic shortest-path prob- 
lems. A stochastic shortest-path problem is an MDP with 
a goal state, where the objective is to find an optimal pol- 
icy (usually containing cycles) among policies that reach the 
goal state with probability one. A policy that reaches the 
goal with probability one, also called a proper policy , can 
be viewed as a probabilistic generalization of the concept of 
a strong cyclic plan. In this respect and others, the symbolic 
LAO* algorithm presented in this paper can be viewed as a 
decision-theoretic generalization of symbolic algorithms for 
nondeterministic planning. 

One difference is that the algorithm presented in this 
paper uses heuristic search to limit the number of states 
for which a policy is computed. An integration of sym- 
bolic model checking with heuristic search has not yet 
been explored for nondeterministic planning. However, 
Edelkamp & Reffel( 1998) describe a symbolic generaliza- 
tion of A* that combines symbolic model checking and 
heuristic search in solving deterministic planning problems. 
A combined approach has also been explored for confor- 
mant planning (Bertoli, Cimatti, & Roveri 2001). 

In motivation, our work is closely related to the frame- 
work of structured reachability analysis , which exploits 
reachability analysis in solving factored MDPs (Boutilier, 
Brafman, & Geib 1998). However, there are important dif- 
ferences. The symbolic model-checking techniques we use 
differ from the approach to state abstraction used in that 
work, which is derived from GRAPHPLAN. More impor- 
tantly, their concept of reachability analysis is weaker than 
the approach adopted here. In their framework, states are 
considered irrelevant if they cannot be reached from the start 
state by following any policy. By contrast, our approach 
considers states irrelevant if it can be proved (by gradually 
expanding a partial policy guided by an admissible heuristic) 
that these states cannot be reached from the start state by fol- 
lowing an optimal policy. Use of an admissible heuristic to 
limit the search space is characteristic of heuristic search, in 
contrast to simple reachability analysis. As Table 2 shows, 
LAO* evaluates much less of the state space than simple 
reachability analysis. The better the heuristic, the smaller 
the number of states it examines. 

Conclusion 

We have described a symbolic generalization of LAO* that 
solves factored MDPs using heuristic search. Given a start 
state, LAO* uses an admissible heuristic to focus computa- 
tion on the parts of the state space that are reachable from 
the start state. The stronger the heuristic, the greater the fo- 
cus and the more efficient a planner based on this approach. 
Symbolic LAO* also exploits state abstraction using sym- 
bolic model checking techniques. It can be viewed as a 


decision-theoretic generalization of symbolic approaches to 
nondeterministic planning. 

To solve very large MDPs, we believe that decision- 
theoretic planners will need to employ a collection of com- 
plementary strategies that exploit different forms of problem 
structure. Showing how to combine heuristic search with 
symbolic techniques for state abstraction is a step in this 
direction. Integrating additional strategies into a decision- 
theoretic planner is a topic for future research. 
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